Imports and data

In [9]:
import numpy as np
import pandas as pd
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)
pd.options.display.max_columns = None
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df = pd.read_excel('Online Retail.xlsx')
df.head()
Out[9]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.00 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.00 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.00 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.00 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.00 United Kingdom

Data exploration and documentation of data issues

Price

In [7]:
print(df.UnitPrice.min())
print(df.UnitPrice.max())
-11062.06
38970.0

LOL some of our prices are negative. Is this maybe supposed to be returns?

However, we also have negative quantities, and the negative quantities aren't associated with negative price values...

In [9]:
print(df.Quantity.min())
print(df.Quantity.max())
-80995
80995
In [10]:
df[df['Quantity'] == df.Quantity.min()]
Out[10]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
540422 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2011-12-09 09:27:00 2.08 16446.0 United Kingdom

What was the item that cost ~40k? Does that price make sense?

In [12]:
df[df['UnitPrice'] == df.UnitPrice.max()]
Out[12]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
222681 C556445 M Manual -1 2011-06-10 15:31:00 38970.0 15098.0 United Kingdom

Not sure what "Manual" means. Here again we see negative quantities. Maybe it makes sense if we look at it in the context of all their transactions?

In [14]:
df[df['CustomerID'] == 15098]
Out[14]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
222670 556442 22502 PICNIC BASKET WICKER SMALL 60 2011-06-10 15:22:00 4.95 15098.0 United Kingdom
222680 556444 22502 PICNIC BASKET WICKER 60 PIECES 60 2011-06-10 15:28:00 649.50 15098.0 United Kingdom
222681 C556445 M Manual -1 2011-06-10 15:31:00 38970.00 15098.0 United Kingdom
222682 556446 22502 PICNIC BASKET WICKER 60 PIECES 1 2011-06-10 15:33:00 649.50 15098.0 United Kingdom
222692 C556448 22502 PICNIC BASKET WICKER SMALL -60 2011-06-10 15:39:00 4.95 15098.0 United Kingdom

It looks like the negative quantities are returns of previous purchases. Maybe the negative prices we saw before are bad data?

How many customers do we have?

In [16]:
len(df.CustomerID.unique())
Out[16]:
4373

How many invoices do we have?

In [17]:
len(df.InvoiceNo.unique())
Out[17]:
25900

missing data

In [4]:
df.isna().sum()
Out[4]:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

What date range are we covering?

In [11]:
print(df.InvoiceDate.min())
print(df.InvoiceDate.max())
2010-12-01 08:26:00
2011-12-09 12:50:00

Data visualization

In [16]:
country = df.groupby('Country').agg({'InvoiceNo':'count'}).reset_index()
country.columns = ['country', 'number_invoices']
country.head()
Out[16]:
country number_invoices
0 Australia 1259
1 Austria 401
2 Bahrain 19
3 Belgium 2069
4 Brazil 32
In [29]:
COLORS = ['rgba(155, 69, 242, 1)', 'rgba(242, 69, 155,1)', 'rgba(69, 155, 242, 1)', 'rgba(69, 242, 69,1)', 
          'rgba(247, 133, 19, 1)', 'rgba(19, 247, 186, 1)', 'rgba(87, 19, 247, 1)', 'rgba(212, 19, 247, 1)', 
          'rgba(226, 244, 22, 1)', 'rgba(244, 195, 22, 1)', 'rgba(166,252,229,1)', 'rgba(234,227,252,1)',
          'rgba(246,209,252,1)', 'rgba(244,251,168,1)', 'rgba(251,233,168,1)', 'rgba(218,186,250,1)'
          'rgba(248,169,209,1)', 'rgba(170,209,248,1)', 'rgba(185,250,187,1)', 'rgba(252,209,169,1)'
          'rgba(214,62,79,1)', 'rgba(245,108,67,1)', 'rgba(84,37,132,1)', 'rgba(132,37,85,1)'
          'rgba(44,99,154,1)', 'rgba(3,101,94,1)', 'rgba(158,1,66,1)', 'rgba(1,60,49,1)'
          'rgba(162,162,162,1)', 'rgba(88,6,51,1)', 'rgba(53,151,142,1)', 'rgba(70,70,70,1)',
          'rgba(138,74,108,1)']

fig = {
'data': [
{
  'values': list(country['number_invoices']),
  'labels': list(country['country']),

  'textposition':'inside',
  'hole': .4,
  'type': 'pie',
'marker': {'colors': COLORS[0:len(list(country['country']))]}
}],
'layout': {
    'title': 'Invoices by country'
}
}
iplot(fig)
In [43]:
data = [go.Histogram(x=df.Quantity, marker=dict(color='#EB89B5'))]

layout = go.Layout(
    title='Quantity Histogram'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)
In [51]:
zoom = df[df['Quantity'] < 200]
zoom = zoom[zoom['Quantity'] > 0]
data = [go.Histogram(x=zoom.Quantity, marker=dict(color='#EB89B5'), nbinsx = 50)]

layout = go.Layout(
    title='Zoomed in Quantity Histogram'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)
In [46]:
data = [go.Histogram(x=df.UnitPrice, marker=dict(color=COLORS[3]))]

layout = go.Layout(
    title='Unit Price Histogram'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)
In [52]:
zoom = df[df['UnitPrice'] < 100]
zoom = zoom[zoom['UnitPrice'] > 0]
data = [go.Histogram(x=zoom.UnitPrice, marker=dict(color=COLORS[3]), nbinsx = 50)]

layout = go.Layout(
    title='Zoomed In Unit Price Histogram'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)
In [66]:
# number of unique items purchased per customer
unique = df[['Description', 'CustomerID']]
unique = unique.drop_duplicates()

unique_items_per_customer = unique.groupby('CustomerID').agg({'Description': 'count'}).reset_index()
unique_items_per_customer.columns = ['customer_id', 'unique_items']
unique_items_per_customer.head()
Out[66]:
customer_id unique_items
0 12346.00 1
1 12347.00 103
2 12348.00 22
3 12349.00 73
4 12350.00 17
In [67]:
data = [go.Histogram(x=unique_items_per_customer.unique_items, marker=dict(color=COLORS[4]))]

layout = go.Layout(
    title='Number of Unique Items Purchased Per Customer'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)